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Abstract 



Parametric policy search algorithms are one of the methods of choice for the opti- 
misation of Markov Decision Processes, with Expectation Maximisation and nat- 
ural gradient ascent being considered the current state of the art in the field. In 
this article we provide a unifying perspective of these two algorithms by showing 
that their search-directions in the parameter space are closely related to the search- 
direction of an approximate Newton method. This analysis leads naturally to the 
consideration of this approximate Newton method as an alternative optimisation 
method for Markov Decision Processes. We are able to show that the algorithm 
has numerous desirable properties, absent in the naive application of Newton's 
method, that make it a viable alternative to either Expectation Maximisation or 
natural gradient ascent. Empirical results suggest that the algorithm has excellent 
convergence and robustness properties, performing strongly in comparison to both 
Expectation Maximisation and natural gradient ascent. 

1 Markov Decision Processes 

Markov Decision Processes (MDPs) are the most commonly used model for the description of se- 
quential decision making processes in a fully observable environment, see e.g. [5 |. A MDP is 
described by the tuple {S , A, H,pi,p,Tr, R}, where S and A are sets known respectively as the 
state and action space, H £ N is the planning horizon, which can be either finite or infinite, and 
{pi , J3, TT, R} are functions that are referred as the initial state distribution, transition dynamics, pol- 
icy (or controller) and the reward function. In general the state and action spaces can be arbitrary 
sets, but we restrict our attention to either discrete sets or subsets of M", where n E N. We use 
boldface notation to represent a vector and also use the notation z — {s, a) to denote a state-action 
pair. Given a MDP the trajectory of the agent is determined by the following recursive procedure: 
Given the agent's state, St, at a given time-point, t E Nh, an action is selected according to the pol- 
icy, at ^ 7r(-|st); The agent will then transition to a new state according to the transition dynamics, 
■St+i ~ St); this process is iterated sequentially through all of the time-points in the plan- 

ning horizon, where the state of the initial time-point is determined by the initial state distribution 
■Si ^ Pi{')- At each time-point the agent receives a (scalar) reward that is determined by the reward 
function, where this function depends on the current action and state of the environment. Typically 
the reward function is assumed to be bounded, but as the objective is linear in the reward function 
we assume w.l.o.g that it is non-negative. 

The most widely used objective in the MDP framework is to maximise the total expected reward of 
the agent over the course of the planning horizon. This objective can take various forms, including 
an infinite planning horizon, with either discounted or average rewards, or a finite planning horizon. 
The theoretical contributions of this paper are applicable to all three frameworks, but for notational 
ease and for reasons of space we concern ourselves with the infinite horizon framework with dis- 
counted rewards. In this framework the boundedness of the objective function is ensured by the 
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introduction of a discount factor, 7 e [0, 1), which scales the rewards of the various time-points in a 
geometric manner. Writing the objective function and trajectory distribution directly in terms of the 
parameter vector then, for any w E W the objective function takes the form 



7*-ii?(a,s) 



(1) 



where we have denoted the parameter space by W and have used the notation pt{a, s; w) to repre- 
sent the marginal p{st ^s,at=a;w) of the joint state-action trajectory distribution 

p{ai.,H,Si.,H--,w) ^ niaHlsH-yW)! Yl Pi^t+i\at, St)T:{at\st]w)lpi{si), HeN. (2) 

Note that the policy is now written in terms of its parametric representation, TT{a\s;w). 

It is well known that the global optimum of ([T} can be obtained through dynamic programming, see 
e.g. 0. However, due to various issues, such as prohibitively large state-action spaces or highly 
non-linear transition dynamics, it is not possible to find the global optimum of ([T]l in most real-world 
problems of interest. Instead most research in this area focuses on obtaining approximate solutions, 
for which there exist numerous techniques, such as approximate dynamic programming methods ||6|, 
Monte-Carlo tree search methods II 191 and policy search methods, both parametric ll27l 1211 1161 ITSi 
and non-parametric [2 , 25|. 

This work is focused solely on parametric poUcy search methods, by which we mean gradient-based 
methods, such as steepest and natural gradient ascent ll23l [n. along with Expectation Maximisation 
ifTlJ . which is a bound optimisation technique from the statistics literature. Since their introduction 
lfT4l [311 [TOl [T6l these methods have been the centre of a large amount of research, with much of it 
focusing on gradient estimation |21 , 4|, variance reduction techniques fSO'TSl, function approxima- 
tion techniques |27, 8 20 1 and real-world applications |18, 26 1. While steepest gradient ascent has 
enjoyed some success it is known to suffer from some substantial issues that often make it unattrac- 
tive in practice, such as slow convergence and poor scaling |23 1. Various optimisation methods have 
been introduced as an alternative, most notably natural gradient ascent lfT6l l24l |3l and Expectation 
Maximisation |18 , 28 1, which are currently considered the current state of the art in the field of 
parametric policy search algorithms. In this paper our primary focus is on the search-direction (in 
the parameter space) of these two methods. 

2 Search Direction Analysis 

In this section we will perform a novel analysis of the search-direction of both natural gradient 
ascent and Expectation Maximisation. In gradient-based algorithms of Markov Decision Processes 
the update of the policy parameters take the form 

w"'''^ = w + aM{w)V^U{w), (3) 

where a £ M"*" is the step-size parameter and M{w) is some positive-definite matrix that possibly 
depends on w. It is well-known that such an update will increase the total expected reward, provided 
that a is sufficiently small, and this process will converge to a local optimum of Q provided the 
step-size sequence is appropriately selected. While EM doesn't have an update of the form given 
in ([3]l we shall see that the algorithm is closely related to such an update. It is convenient for later 
reference to note that the gradient Wu}U{w) can be written in the following form 



V„ logTr{a\s;w) 



(4) 



where we use the expectation notation E[ ] to denote the integral/summation w.r.t. a non-negative 
function. The term p^{z; w) is a geometric weighted average of state-action occupancy marginals 
given by 

00 

Pjiz;w) = ^7*-V(2;;i^), 

t=l 

while the term Q{z;w) is referred to as the state-action value function and is equal to the total 
expected future reward from the current time-point onwards, given the current state-action pair, z, 
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and parameter vector, w, i.e. 



Q{z;w) = ^Ep^(^/.^ 



Zi = z 



This is a standard result and due to reasons of space we have omitted the details, but see e.g. ll27l or 
section! 6.1 1 of the supplementary material for more details. 



An immediate issue concerning updates of the form ^ is in the selection of the 'optimal' choice 
of the matrix A4{w), which clearly depends on the sense in which optimality is defined. There 
are numerous reasonable properties that are desirable of such an update, including the numerical 
stability and computational complexity of the parameter update, as well as the rate of convergence 
of the overall algorithm resulting from these updates. While all reasonable criteria the rate of con- 
vergence is of such importance in an optimisation algorithm that it is a logical starting point in our 
analysis. For this reason we concern ourselves with relating these two parametric policy search al- 
gorithms to the Newton method, which has the highly desirable property of having a quadratic rate 
of convergence in the vicinity of a local optimum. The Newton method is well-known to suffer from 
problems that make it either infeasible or unattractive in practice, but in terms of forming a basis for 
theoretical comparisons it is a logical starting point. We shall discuss some of the issues with the 
Newton method in more detail in section([3|. In the Newton method the matrix M.{w) is set to the 
negative inverse Hessian, i.e. 

M{w) = -H-\w), where Hiw) = V^„V^C/(t(j). 

where we have denoted the Hessian by T-L{w). Using methods similar to those used to calculate the 
gradient, it can be shown that the Hessian takes the form 



where 



oo 

'H2{w) = ^Ep(^^^^.^ 



t=l 



7* '^R{zt)V^ \ogp{zi.,t\ w)Vl, \ogp{zi.,t\w) 



-f'-^R{zt)V^Vl\ogp{zi.,t;w) 



(5) 



(6) 



(7) 



We have omitted the details of the derivation, but these can be found in section! 6. 2 1 of the sup- 
plementary material, with a similar derivation of a sample-based estimate of the Hessian given in 



2.1 Natural Gradient Ascent 

To overcome some of the issues that can hinder steepest gradient ascent an alternative, natural 
gradient, was introduced in 1 16|. Natural gradient ascent techniques originated in the neural network 
and blind source separation literature, see e.g. ||T|, and take the perspective that the parameter space 
has a Riemannian manifold structure, as opposed to a Euclidean structure. Deriving the steepest 
ascent direction of U{w) w.r.t. a local norm defined on this parameter manifold (as opposed to w.r.t. 
the Euclidean norm, which is the case in steepest gradient ascent) results in natural gradient ascent. 
We denote the quadratic form that induces this local norm on the parameter manifold by G{w), i.e. 
d{wy' — G{w)w. The derivation for natural gradient ascent is well-known, see e.g. [JJ, and its 
application to the objective ([T]) results in a parameter update of the form 

In terms of ([sjl this corresponds to Ai{w) = G^^{w). In the case of MDPs the most commonly 
used local norm is given by the Fisher information matrix of the trajectory distribution, see e.g. 
and due to the Markovian structure of the dynamics it is given by 



G{w) 



-E 



p_,(z;ui) 



V^V!^ log7r(a|s;u;) 



(8) 



We note that there is an alternate, but equivalent, form of writing the Fisher information matrix, see 
^■g- ll24l . but we do not use it in this work. 
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In order to relate natural gradient ascent to the Newton method we first rewrite the matrix (j7]i into 
the following form 



1-L2{w) = E 



V„V^log7r(a|s;u;) 



(9) 



For reasons of space the details of this reformulation of ^\ are left to section( [6^ of the supplemen- 
tary material. Comparing the Fisher information matrix |8]l with the form of il^w) given in (j9| it is 
clear that natural gradient ascent has a relationship with the approximate Newton method that uses 
7^2(1^') in place of ^{w). In terms of ([3| this approximate Newton method corresponds to setting 
/^{w) = —T-L^^^w). In particular it can be seen that the difference between the two methods lies 
in the non-negative function w.r.t. which the expectation is taken in ([8]l and (j9]l. (It also appears 
that there is a difference in sign, but observing the form of M.{w) for each algorithm shows that 
this is not the case.) In the Fisher information matrix the expectation is taken w.r.t. to the geometri- 
cally weighted summation of state-action occupancy marginals of the trajectory distribution, while 
in 7i2{w) there is an additional weighting from the state-action value function. Hence, H2{w) 
incorporates information about the reward structure of the objective function, whereas the Fisher 
information matrix does not, and so it will generally contain more information about the curvature 
of the objective function. 



2.2 Expectation Maximisation 

The Expectation Maximisation algorithm, or EM-algorithm, is a powerful optimisation technique 
from the statistics literature, see e.g. ifTTl . that has recently been the centre of much research in 
the planning and reinforcement learning communities, see e.g. 11011281 118 1. A quantity of central 
importance in the EM-algorithm for MDPs is the following lower-bound on the log-objective 



l0gU{w) > 'Hentropy(g(^l:t,t)) + Egi^^^,^^t) 



log 7* ^R{zt)p{zi.,t;w) 



(10) 



where Hentiopy is the entropy function and q{zi-t,t) is known as the 'variational distribution'. Further 
details of the EM-algorithm for MDPs and a derivation of ( [T0| are given in section( [63] l of the 
supplementary material or can be found in e.g. 1 18 28 1. The parameter update of the EM-algorithm 
is given by the maximum (w.r.t. w) of the 'energy' term. 



p^{z:Wk)Q{z;Wk) 



log7r(a|s; w) 



(11) 



Note that Q is a two-parameter function, where the first parameter occurs inside the expectation 
and the second parameter defines the non-negative function w.rt. the expectation is taken. This 
decoupling allows the maximisation over w to be performed explicitly in many cases of interest. 
For example, when the log-policy is quadratic in w the maximisation problems is equivalent to a 
least-squares problem and the optimum can be found through solving a linear system of equations. 

It has previously been noted, again see e.g. |18|, that the parameter update of steepest gradient 
ascent and the EM-algorithm can be related through this 'energy' term. In particular the gradient 
Q evaluated at can also be written as follows '^w\u,=wk U (w) = V^|^^^^ Qiw, w^), where 
we use the notation to denote the first derivative w.rt. the first parameter, while the update of 
the EM-algorithm is given by Wk+i ~ argmax.^ygyy Q{w, Wk). In other words, steepest gradient 
ascent moves in the direction that most rapidly increases Q{w, Wk) w.rt. the first variable, while the 
EM-algorithm maximises Q{w, Wk) w.rt. the first variable. While this relationship is true, it is also 
quite a negative result. It states that in situations where it is not possible to explicitly perform the 
maximisation over w in ( [TT] ) then the alternative, in terms of the EM-algorithm, is this generalised 
EM-algorithm, which is equivalent to steepest gradient ascent. Considering that algorithms such as 
EM are typically considered because of the negative aspects related to steepest gradient ascent this 



is an undesirable alternative. It is possible to find the optimum of ( 1 1 1 numerically, but this is also 
undesirable as it results in a double-loop algorithm that could be computationally expensive. Finally, 
this result provides no insight into the behaviour of the EM-algorithm, in terms of the direction of 



its parameter update, when the maximisation over in ( 1 1 1 can be performed explicitly. 



Instead we provide the following result, which shows that the step-direction of the EM-algorithm 
has an underlying relationship with the Newton method. In particular we show that, under suitable 
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regularity conditions, the direction of the EM-update, i.e. Wk+i — w^, is the same, up to first order, 
as the direction of an approximate Newton method that uses 'H2{w) in place of H{w). 

Theorem 1. Suppose we are given a Markov Decision Process with objective ([7]) and Markovian 
trajectory distribution (|2]l. Consider the update of the parameter through Expectation Maximisation 
at the fc* iteration of the algorithm, i.e. 

it>fc+i = argmax.^gyy Q{w,Wk). 

Provided that Q{w, Wk) is twice continuously differentiate in the first parameter we have that 

Wk+l -Wk = -'H2^i'^k)'^-w\-w=-wtU{w) + 0{\\Wk+l - WkW^). (12) 

Additionally, in the case where the log-policy is quadratic the relation to the approximate Newton 
method is exact, i.e. the second term on the r.h.s. \12\ is zero. 

Proof. The idea of the proof is simple and only involves performing a Taylor expansion of 
'^wQ{'w,Wk). As Q is assumed to be twice continuously differentiable in the first component 
this Taylor expansion is possible and gives 

V]^Q{wk+i,Wk)^Vl^Q{wk,Wk)+V^^Q{wk,Wk)iwk+i-Wk)+0{\\wk+i~Wk\^^^^ (13) 

As Wk+i = argmax^gyy Q{w, Wk) it follows that Wl^Q{wk+i,Wk) = 0. This means that, upon 
ignoring higher order terms in Wk+i — Wk, the Taylor expansion ( [T3| ) can be rewritten into the form 

Wk+i-Wk = -y''^Q{wk,Wk)-^V'^Q{wu,Wk). (14) 

The proof is completed by observing that V^^Qiw^^Wk) = ■w\w=vakU{w) and 
V^Q(t(;fe, WfS) = 'H2{wk). The second statement follows because in the case where the Zog-pohcy 
is quadratic the higher order terms in the Taylor expansion vanish. □ 

2.3 Summary 

In this section we have provided a novel analysis of both natural gradient ascent and Expectation 
Maximisation when applied to the MDP framework. Previously, while both of these algorithms have 
proved popular methods for MDP optimisation, there has been little understanding of them in terms 
of their search-direction in the parameter space or their relation to the Newton method. Firstly, our 
analysis shows that the Fisher information matrix, which is used in natural gradient ascent, is similar 
to 'H2{w) in pll with the exception that the information about the reward structure of the problem 
is not contained in the Fisher information matrix, while such information is contained in 'H2iw). 
Additionally we have shown that the step-direction of the EM-algorithm is, up to first order, an 
approximate Newton method that uses H2{w) in place of H{w) and employs a constant step-size 
of one. 

3 An Approximate Newton Method 

A natural follow on from the analysis in section(j2|l is the consideration of using M{w) = -n^^w) 
in ([3]l, a method we call the full approximate Newton method from this point onwards. In this section 
we show that this method has many desirable properties that make it an attractive alternative to other 
parametric policy search methods. Additionally, denoting the diagonal matrix formed from the 
diagonal elements of H2{w) by V2{'u>), we shall also consider the method that uses M{w) = 
—'D2^{w) in (jsji. We call this second method the diagonal approximate Newton method. 

Recall that in ^ it is necessary that M{w) is positive-definite (in the Newton method this corre- 
sponds to requiring the Hessian to be negative-definite) to ensure an increase of the objective. In 
general the objective ([T]i is not concave, which means that the Hessian will not be negative-definite 
over the entire parameter space. In such cases the Newton method can actually lower the objective 
and this is an undesirable aspect of the Newton method. An attractive property of the approximate 
Hessian, 'H2{w), is that it is always negative-definite when the policy is log-concave in the policy 
parameters. This fact follows from the observation that in such cases 7i2{'w) is a non-negative mix- 
ture of negative-definite matrices, which again is negative-definite Additionally, the diagonal 
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terms of a negative-definite matrix are negative and so 'D2{w) is also negative-definite when the 
controller is log-concave. 

To motivate this result we now briefly consider some widely used policies that are either log-concave 
or blockwise log-concave. Firstly, consider the Gibb's policy, 7r(a|s; w) cx exp xu^ 4>{a, s), where 
4>{a, s) E M""" is a feature vector. This policy is widely used in discrete systems and is log- 
concave in w, which can be seen from the fact that log7r(a|s; w) is the sum of a linear term and 
a negative log-sum-exp term, both of which are concave [9J. In systems with a continuous state- 
action space a common choice of controller is 7r(a|s; tOmean, = M{a\K4>{s) + m, where 
'^mean = {K , m} and cf){s) e M"'" is a feature vector. The notation I](s) is used because there 
are cases where is it beneficial to have state dependent noise in the controller. This controller is not 
jointly log-concave in tOmean and S, but it is blockwise log-concave in lOmean and S^^. In terms of 
Wrasw the log-policy is quadratic and the coefficient matrix of the quadratic term is negative-definite. 
In terms of S^^ the log-policy consists of a linear term and a log-determinant term, both of which 
are concave. 

In terms of evaluating the search direction it is clear from the forms of 'D2{w) and H2{'w) that 
many of the pre-existing gradient evaluation techniques can be extended to the approximate Newton 
framework with little difficulty. In particular, gradient evaluation requires calculating the expectation 
of the derivative of the log-policy w.r.t. p^{z; w)Q{z; w). In terms of inference the only additional 
calculation necessary to implement either the fall or diagonal approximate Newton methods is the 
calculation of the expectation (w.r.t. to the same function) of the Hessian of the log-policy, or its 



diagonal terms. As an example in section! 6.5 1 of the supplementary material we detail the extension 
of the recurrent state formulation of gradient evaluation in the average reward framework, see e.g. 
||3ll, to the approximate Newton method. We use this extension in the Tetris experiment that we 
consider in sectionQ. Given Ug samples and parameters the complexity of these extensions 
scale as 0{nsnw) for the diagonal approximate Newton method, while it scales as ©(n^n^) for the 
full approximate Newton method. 

An issue with the Newton method is the inversion of the Hessian matrix, which scales with 0{n^) in 
the worst case. In the standard application of the Newton method this inversion has to be performed 
at each iteration and in large parameter systems this becomes prohibitively costly. In general H{w) 
will be dense and no computational savings will be possible when performing this matrix inversion. 
The same is not true, however, of the matrices 'D2{w) and H2{w). Firstly, as 2?2(if) is a diagonal 
matrix it is trivial to invert. Secondly, there is an immediate source of sparsity that comes from 
taking the second derivative of the log-trajectory distribution in (j7]i. This property ensures that any 
(product) sparsity over the control parameters in the log-trajectory distribution will correspond to 
sparsity in 'H2{w). For example, in a partially observable Markov Decision Processes where the 
policy is modeled through a finite state controller, see e.g. \22\, there are three functions to be 
optimised, namely the initial belief distribution, the belief transition dynamics and the policy. When 
the parameters of these three functions are independent H2{w) will be block-diagonal (across the 
parameters of the three functions) and the matrix inversion can be performed more efficiently by 
inverting each of the block matrices individually. The reason that T-L{w) does not exhibit any such 
sparsity properties is due to the term 'Hi{w) in (jsj), which consists of the non-negative mixture of 
outer-product matrices. The vector in these outer-products is the derivative of the log-trajectory 
distribution and this typically produces a dense matrix. 

A undesirable aspect of steepest gradient ascent is that its performance is affected by the choice of 
basis used to represent the parameter space. An important and desirable property of the Newton 
method is that it is invariant to non-singular linear (affine) transformations of the parameter space, 
see e.g. [9] . This means that given a non-singular linear (affine) mapping T € K"'" ^ , the Newton 
update of the objective U{w) — U{Tw) is related to the Newton update of the original objective 
through the same linear (affine) mapping, i.e. v + At^nt = T{w + AiOnt), where v ~ Tw and 
Avnx and Aw^i denote the respective Newton steps. In other words running the Newton method on 
U{w) and U{T~^w) will give identical results. An important point to note is that this desirable 
property is maintained when using T-L2{w) or 'D2{'w) in an approximate Newton method. This can 
be seen by using the linearity of the expectation operator and a proof of this statement is provided 



in section! 6.4 1 of the supplementary material. 
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(a) Policy Trace (b) Tetris Problem 

Figure 1: (a) An empirical illustration of the affine invariance of the approximate Newton method, 
performed on the two state MDP of lfT6ll . The plot shows the trace of the policy during training 
for the two different parameter spaces, where the results of the latter have been mapped back into 
the original parameter space for comparison. The plot shows the two steepest gradient ascent traces 
(blue cross and blue circle) and the two traces of the approximate Newton method (red cross and red 
circle), (b) Results of the tetris problem for steepest gradient ascent (black), natural gradient ascent 
(green), the diagonal approximate Newton method (blue) and the approximate Newton method (red). 



4 Experiments 

The first experiment we performed was an empirical illustration that the approximate Newton 
method is invariant to linear transformations of the parameter space. We considered the simple 
two state example of lfT6l as it allows us to plot the trace of the policy during training, since the 
policy has only two parameters. The policy was trained using both steepest gradient ascent and the 
approximate Newton method and in both the original and linearly transformed parameter space. The 
policy traces of the two algorithms are plotted in figure! T[a i. As expected steepest gradient ascent is 



affected by such mappings, whilst the approximate Newton method is invariant to them. 

The second experiment was aimed at demonstrating the scalability of the full and diagonal approxi- 
mate Newton methods to problems with a large state space. We considered the tetris domain, which 
is a popular computer game designed by Alexey Pajitnov in 1985. See 1 12 1 for more details. Firstly, 
we compared the performance of the full and diagonal approximate Newton methods to other para- 
metric policy search methods. Tetris is typically played on a 20 x 10 grid, but due to computational 
costs we considered a 10 x 10 grid in the experiment. This results in a state space with roughly 
7 X 2^"° states. We modelled the policy through a Gibb's distribution, where we considered a fea- 
ture vector with the following features: the heights of each column, the difference in heights between 
adjacent columns, the maximum height and the number of 'holes'. Under this policy it is not possi- 



ble to obtain the explicit maximum over it) in ( 1 1 1 and so a straightforward application of EM is not 



possible in this problem. We therefore compared the diagonal and full approximate Newton methods 
with steepest and natural gradient ascent. Due to reasons of space the exact implementation of the 



experiment is detailed in section! 6.6 1 of the supplementary material. We ran 100 repetitions of the 
experiment, each consisting of 100 training iterations, and the mean and standard error of the results 
are given in figure! Tjbi. It can be seen that the full approximate Newton method outperforms all of 



the other methods, while the performance of the diagonal approximate Newton method is compa- 
rable to natural gradient ascent. We also ran several training runs of the full approximate Newton 
method on the full-sized 20 x 10 board and were able to obtain a score in the region of 14, 000 
completed lines, which was obtained after roughly 40 training iterations. An approximate dynamic 
programming based method has previously been applied to the Tetris domain in |7|. The same set 
of features were used and a score of roughly 4, 500 completed lines was obtained after around 6 
training iterations, after which the solution then deteriorated. 

In the third experiment we considered a finite horizon (controlled) linear dynamical system. This 
allowed the search-directions of the various algorithms to be computed exactly using [13] and re- 
moved any issues of approximate inference from the comparison. In particular we considered a 
3-Iink rigid manipulator, linearized through feedback linearisation, see e.g. ifTTll . This system has a 
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Training Time Training Iterations 

(a) Model-Based Linear System (b) Model-Free Non-Linear System 

Figure 2: (a) The normalised total expected reward plotted against training time, in seconds, for the 
3-link rigid manipulator. The plot shows the results for steepest gradient ascent (black), EM (blue), 
natural gradient ascent (green) and the approximate Newton method (red), where the plot shows the 
mean and standard error of the results, (b) The normalised total expected reward plotted against 
training iterations for the synthetic non-linear system of ||291 . The plot shows the results for EM 
(blue), steepest gradient ascent (black), natural gradient ascent (green) and the approximate Newton 
method (red), where the plot shows the mean and standard error of the results. 



6-dimensional state space, 3-dimensional action space and a 22-dimensional parameter space. Fur- 
ther details of the system can be found in section! 6.7 i of the supplementary material. We ran the 
experiment 100 times and the mean and standard error of the results plotted in figure! 2[ai. In this 
experiment the approximate Newton method found substantially better solutions than either steep- 
est gradient ascent, natural gradient ascent or Expectation Maximisation. The superiority of the 
results in comparison to either steepest or natural gradient ascent can be explained by the fact that 
7^2 (lA') gives a better estimate of the curvature of the objective function. Expectation Maximisation 
performed poorly in this experiment, exhibiting sub-linear convergence. Steepest gradient ascent 
performed 3684 ± 314 training iterations in this experiment which, in comparison to the 203 ± 34 
and 310 ± 40 iterations of natural gradient ascent and the approximate Newton method respectively, 
illustrates the susceptibility of this method to poor scaling. In the final experiment we considered the 
synthetic non-linear system considered in |29|. Full details of the system and the experiment can be 
found in section! 6.8 1 of the supplementary material. We ran the experiment 100 times and the mean 
and standard error of the results are plotted in figure! [2]b| . Again the approximate Newton method 
outperforms both steepest and natural gradient ascent. In this example only the mean parameters of 
the Gaussian controller are optimised, while the parameters of the noise are held fixed, which means 
that the log-policy is quadratic in the policy parameters. Hence, in this example the EM-algorithm 
is a particular (less general) version of the approximate Newton method, where a fixed step-size 
of one is used throughout. The marked difference in performance between the EM-algorithm and 
the approximate Newton method shows the benefit of being able to tune the step-size sequence. 
In this experiment we considered five different step-size sequences for the approximate Newton 
method and all of them obtained superior results than the EM-algorithm. In contrast only one of 
the seven step-size sequences considered for steepest and natural gradient ascent outperformed the 
EM-algorithm. 



5 Conclusion 



The contributions of this paper are twofold: Firstly we have given a novel analysis of Expectation 
Maximisation and natural gradient ascent when applied to the MDP framework, showing that both 
have close connections to an approximate Newton method; Secondly, prompted by this analysis 
we have considered the direct application of this approximate Newton method to the optimisation of 
MDPs, showing that it has numerous desirable properties that are not present in the naive application 
of the Newton method. In terms of empirical performance we have found the approximate Newton 
method to perform consistently well in comparison to EM and natural gradient ascent, highlighting 
its viability as an alternative to either of these methods. At present we have only considered actor 
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type implementations of the approximate Newton method and the extension to actor-critic methods 
is a point of future research. 
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6 Supplementary Material 



6.1 Gradient Derivation 



For ease of reference in this section we give a brief outline of the derivation for the derivative of 
([TJ. As in the rest of the paper we focus on the case of an infinite planning horizon with discounted 
rewards, where other frameworks follow similarly. The first point to note is that for any < e N we 
have the following identity, often referred to as the 'log-trick', 

rup{zi.,t; w) = p{zi.,t; w)S/ \ogp{zi.,t; w), 

where the usual limit arguments are made in the case p{zi.t; w) — 0. Upon interchanging the order 
of integration and differentiation the gradient takes the form 



7* ^R{zt)V^\ogp{zi.,uw) 



Due to the Markovian structure of the trajectory distribution Q this derivative can be written in the 
equivalent form 



oo oo 



7* ^R{zt)V^\ogp{ar\Sr\w) 



7* ^R{zt)V^\ogp{ar\sr;w) 



where the second line follows from the first through an interchange of the summations. The chain 
structure of the trajectory distribution allows the expectation over the marginals of the two time- 
points of the trajectory distribution to be written as follows 



{z'\w) 



-1^-^R{z')\z^ = z 



\ogp{a\s;w) 



(15) 



where we have used the notation Pt{z\ w) = p{zr = z; w), for t e N. The summation over the 
inner expectation in ( 15 i can be seen to be equal to the state-action value function scaled by 7^~^ , 
i.e. 



Y-'Q{z;w)^J2^ 



pt{z'-w) 



J*~'^R{z')\zr = Z 



Inserting this form for this inner expectation into ( 15 i gives 



(z-w) 



T = l 
E 



Y ^Q[z;w)V^\ogp{a\s\w) 



p^l (z;w)Q{z:w) 



logp{a\s;w) 



where the second line follows from the definition of Pj{z; w). This completes the derivation of Q. 
This derivation is different, but equivalent, to the standard derivation in II27]| . 



6.2 Hessian Derivation 



Following similar calculations to those used section! 6.1 1 it is simple to see that the Hessian takes 
the form 



7 R{zt) V„ logp(2i:t; w)V^ logp{zi.,t;w)+V^V^ \ogp{zi.,t; w) 



where the first term is obtained by a second application of the 'log-trick' . This is the form of the 
Hessian given in (|5|. Theformof-H2(i«) given in d9|l follows from analogous calculations to those 
performed in section([6T|) of the supplementary material. 
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6.3 Expectation Maximisation Derivation 



For ease of reference in this section we give a brief derivation of the application of Expectation 
Maximisation to Markov Decision Processes. As in the rest of the paper we focus on the case of 
an infinite planning horizon with discounted rewards, where other frameworks follow similarly. A 
quantity of central importance in this derivation is the following non-negative function 

P{Zl:t,t] W) = '-f*^^R{Zt)p{Zi.,t-,w). 

This function is non-negative because the reward function is considered as non-negative. Note that 
this function is trans-dimensional, i.e. it is a function of both t E N and the state-action variables 
up until the time-point. It can be seen from (mi and (Ob that the 'normalisation constant' of 
p{zi;t, t; w) is equal to U{w), i.e. U{w) = J2't=iz2zi.t ^^i:*' ^' denote the normalised 

version of this function by p{zi.t,t; w). 

To obtain the lower-bound on the log-objective we note that the Kullback-Leibler divergence be- 
tween any two distributions is always non-negative and so calculating the Kullback-Leibler diver- 
gence between the variational distribution, q{zi-t, t), and p{zi-t,t; w) gives 



KL{q\\p) = \0gU{w) -nenaopy{q{zi:t,t)) - Eq{zi:t.t) 



log p{zi.,t,t;w) 



> 0. 



The lower-bound (lOi is now immediately obtained from the definition of p{zi.t,t; w). 



An EM-algorithm is obtained through coordinate-wise optimisation of the bound (10 1 w.r.t. the 
variational distribution (the E-step) and the policy parameters (the M-step). In the E-step the lower- 
bound (10 1 is optimised when the Kullback-Leibler divergence is zero, namely when q{zi-t,t) 



pizi-t, t; Wk), where Wk are the current policy parameters. In the E-step the lower-bound is opti- 
mised w.r.t. w, which is equivalent to optimising the quantity 

t 



logp{zi.,t,t;w) 
Using the definition of p{zi-t,t; Wk) we have 



^\0g1T{ar\Sr]w) 



terms independent of w. 



D{Zr ,Zt:Wk) 



7* ^R{zt)logTr{ar\sr;w) 



so that using the same marupulations as those used in section(6.1 1 of the supplementary material we 
have that the E-step is equivalent to maximising the function 



Q{w,Wk) =Ep^,(z:-w,)Q{z 



log7r(a|s; w) 



w.r.t. to the first parameter, w. This completes the derivation of the EM-algorithm for Markov 
Decision Processes with an infinite planning horizon and discounted rewards. 



6.4 Affine Invariance of Approximate Newton Metliod 

To show that the use of H2{w) in place of the true Hessian maintains the linear (affine) invariance 
of the Newton method we use the following formulae 

where / is some twice differentiable function of w, f{w) = f{Tw) and v = Tw. Using these 
formulae we have the following two identities 

Vu, logp(a|s; Tw) = T^V^ logp(a|s; v), 

ViuV^ logp(a|s; Tw) ^ T'^V^V^ logp(a|s; v)T, 

which hold for each (s, a) E S y. A. Defining U{w) ~ U (Tw ), we have WwU{w) = T^U{v). 
Following calculations almost identical to those in section(6.2i it can be shown that T-L2{w) takes 
the form 



E 



p^ (z:Tw)Q{z;Tw) 



V„V^logp(a|s;Tt(?) 
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which gives the following 



1-L2{w) — ^p^{z:rw)Q{z;r-u 



^p-,(z:Tw)Q(z-T-w) 



(z;Tw)Q(z-,Tw) 

■T 



V^V^ \ogp{a\s]Tw) 
T^V^Vllogp{a\s;v)T 



r, 



^T' n2{v)T. 

Using these two expressions we have that the parameter updates, under the approximate Newton 
method, of the objective functions U and U are related as follows 

■"new + a'H2{vy^^vU{v), 

= T{w + aH2{w)-^W^U{w)), 

where a is some step-size parameter This shows that the approximate Newton method is affine 
invariant. The affine invariance of the diagonal approximate Newton method follows similarly. 

6.5 Recurrent State Search Direction Evaluation for Approximate Newton Method 

In OTi a sampling algorithm was provided for estimating the gradient of an infinite horizon MDP 
with average rewards. This algorithm makes use of a recurrent state, which we denote by s* . In 
algorithm([T]) we detail a straightforward extension of this algorithm to the estimation the approxi- 
mate Hessian, T-L2{w), in this MDP framework. The analogous algorithm for the estimation of the 
diagonal matrix, T>2{w), follows similarly. In algorithmimi we make use of an eligibility trace for 
both the gradient and the approximate Hessian, which we denote by and respectively. The 
estimates (up to a positive scalar) of the gradient and the approximate Hessian are denoted by 
and A'^ respectively. 

6.6 Tetris Experiment Specification 

In this section we give a detailed specification of the procedure used for each of the algorithms in 
the tetris experiment. The same general procedure was used for all the algorithms considered in the 
experiments. We modelled the environment through with an infinite planning horizon with average 
rewards. The reward at each time-point is equal to the number of lines deleted. We used a recurrent 
state formulation 1 3 1 1 of the gradient of the average reward framework to perform the gradient evalu- 
ation. We used analogous versions of this recurrent state formulation for natural gradient ascent, the 



diagonal approximate Newton method and the full approximate Newton method. See section! 6.5 1 
of the supplementary material for more details. As in lfT6l we used the sample trajectories obtained 
during the gradient evaluation to estimate the Fisher information matrix. We used the empty board 
as a recurrent state. A new game starts with an empty board and so this too is a recurrent state. Irre- 
spective of the policy a game of tetris is guaranteed to terminate after a finite number of turns, see 
e.g. |7 1, and this guarantees the recurrence property of this state. During each training iteration the 
(approximation of the) search direction was obtained by sampling 1000 games, where these games 
were sampled using the current policy parameters. Given the current approximate search direction 
the following basic line search method was used to obtain a step-size: For every step-size in a given 
finite set of step-sizes sample a set number of games and then return the step-size with the maximal 
score over these games. In practice, in order to reduce the susceptibility to random noise, we used 
the same simulator seed for each possible step-size in the set. To avoid over-fitting a different sim- 
ulator seed was used during each training iteration. In this line search procedure we sampled 1000 
games for each of the possible step-sizes. The same set of step-sizes was used in all of the different 
training algorithms considered in section(|4]i, where we used the set 

{0.1, 0.5, 1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0}. 

To reduce the amount of noise in the results we used the same set of simulator seeds in the search 
direction evaluation for each of the algorithms considered in sectionQ. In particular, we generated 
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Algorithm 1 Recurrent state sampling algorithm to estimate the search direction of the approximate 
Newton method when applied to a MDP with an infinite planning horizon with average rewards. 

Sample a state from the initial state distribution: 

si -_pi(-)- 

for t = 1, , N, for some N eN, do 

Given the current state, sample an action from the policy: 

at ^ ■7T{-\st;w). 

if St 7^ s* then 

Update the eligibility traces: 

*i ^ *i + V^log7r(at|sf;i(;) ^ *2 _^ V^V^ log 7r(at|st; u;) 

else 

reset the eligibility traces: 
end if 

Update the estimates of the gradient and the approximate Hessian: 



Sample state from the transition dynamics: 

St+i ^ p{-\at,st). 

end for 

Return the estimated gradient and approximate Hessian, which up to a positive scaling are given 
by A^ and A^ respectively. 



a riexperiments X ^T^iterations matrix of simulator sccds, where nexperiments was the number of repetitions 
of the experiment and niterations was the number of training iterations in each experiment. We then 
used this one matrix in all of the different training algorithms, where the element in the j* column 
and z* row corresponds to the simulator seed used in the training iteration of the experiment. 
In a similar manner the set of simulator seeds used for the line search procedure was the same for 
all of the different training algorithms. Finally, to make the line search consistent among all of the 
different training algorithms the search direction was normlised and the resulting unit vector was the 
vector used in the line search procedure. 

6.7 Linear System Specification 

The A^-link rigid robot arm manipulator is a standard continuous MDP model, consisting of an end 
effector connected to an A^-linked rigid body iflTl . A typical continuous control problem for such 
systems is to apply appropriate torque forces to the joints of the manipulator so as to move the end 
effector into a desired position. The state of the system is given by q, q, q E M^, where q, q and q 
denote the angles, velocities and accelerations of the joints respectively, while the control variables 
are the torques applied to the joints r e M^. The nonlinear state equations of the system are given 
by, see e.g. ITtI, 

M{q)q + C{q,q)q + g{q)=T (16) 

where M{q) is the inertia matrix, C{q, q) denotes the Coriolis and centripetal forces and g{q) is 
the gravitational force. While this system is highly nonlinear it is possible to define an appropriate 
control function T(q, q) that results in linear dynamics in a different state-action space. This process 
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is Ccilled feedback linearisation, see e.g. [17], and in the case of an A^-link rigid manipulator recasts 
the torque action space into the acceleration action space. This means that the state of the system 
is now given by q and q, while the control is a = qi. Ordinarily in such problems the reward 
would be a function of the generalised co-ordinates of the end effector, which results in a non-trivial 
reward function in terms of q, q and q. While this reward function can be modelled as a mixture of 
Gaussians for simplicity we consider the simpler problem where the reward is a function of q, q and 
q directly. 

In the experiments we considered a 3-link rigid manipulator, which results in a 9-dimensional state- 
action space and a 22-dimensional policy. In the experiment we discretised the continuous time dy- 
namics into time-steps of At — 0.1 and considered a finite planning horizon of H ^ 100. The mean 
of the initial state distribution was set to zero. The elements of the policy parameters m and tt^ were 
initialised randomly from the interval [—2, 2] and [1, 2] respectively. The matrix K was initialised 
to be zero on inter-link entries, while intra-link entries were initialised using rejection sampling. We 
sampled the parameters for each link independently from the set [—400, 40] x [—50, 10] and rejected 
the sample if the corresponding link was unstable. In the reward function the desired angle of each 
joint was randomly sampled from the interval [7r/4, 37r/4]. The covariance matrices of the initial 
state distribution and state transition dynamics were set to diagonals, where the diagonal elements 
were initialised randomly from the interval [0, 0.05]. The covariance matrix of the reward function 
was set to be a diagonal with all entries equal to 0.1. 

We used the minFuncj^optimisation library in all of the gradient-based algorithms. We found that 
both the line search algorithm and the step-size initialisation had an effect on the performance of all 
the algorithms. We therefore tried various combinations of these settings for each algorithm and se- 
lected the one that gave the best performance. We tried bracketing line search algorithms with: step- 
size halving; quadratic/cubic interpolation from new function values; cubic interpolation from new 
function and gradient values; step-size doubling and bisection; cubic interpolation/extrapolation 
with function and gradient values. We tried the following step-size initialisations; quadratic initial- 
ization using previous function value and new function value/gradient; twice the previous step-size. 
To handle situations where the initial policy parameterisation was in a 'flat' area of the parameter 
space far from any optima we set the function and point toleration of minFunc to zero for all al- 
gorithms. To handle the large number of training iterations required by steepest gradient ascent we 
increased the maximum number of function evaluations and training iterations to 10, 000. 



6.8 Non-Linear System Specification 

In section we detail the implementation of the experiment on the synthetic two-dimensional non- 
linear MDP considered in |29|. The state-space of the problem is two-dimensional, s = (s^, s^), 
where is the agent's position and is the agent's velocity. The control is one-dimensional and 
the dynamics of the system is given as follows 

Sf+i — si + 0.5 + K, 

1 + e^"* 

•St+l = St — O.lsl^i + K, 

where k is a zero-mean Gaussian random variable with standard deviation (7^ = 0.02. The agent 
starts in the state s = (0, 1), with the addition of Gaussian noise with standard deviation 0.001, 
and the objective is to transport the agent to the state (0,0). We use the same policy as in ||29J , 
which is given by at = (w + et)^ St, where w are the control parameters and ct ~ M{et; 0, Ug/). 
The objective function is non-trivial for w G [0, 60] x [—8, 0]. In the experiment the initial control 
parameters were sampled from the region Wq e [0, 60] x [—8, 0]. We considered a finite planning 
horizon with a planning horizon of H = 80. We used forward sampling to perform the inference, 
where in all algorithms 50 trajectories were sampled during each training iteration. 

We also detail the procedure used for the step-size tuning for the approximate Newton method and 
natural gradient ascent. The tuning of the step-size sequence in steepest gradient ascent was similar 
in nature to natural gradient ascent and so is omitted from the discussion. Using the intuition that 
the approximate Newton method has a natural step-size of one, which corresponds to an EM-step in 
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This software library is freely available at http://www.di.ens.fr/~mschmidt/Software/ 
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this problem because the log-policy is quadratic in the control parameters, we considered step-size 
sequences of the form ak = {1 — jf)oi + j^, where N is the total number of training iterations 
considered and a € M"*". In the experiment we considered the values a = 1, 6, 12, 18 and 24. 
The intuition used in this selection was that, provided that the steps were not so large as to cause 
overshooting in the parameter space, larger steps will increase performance. In steepest and natural 
gradient ascent we used step-size sequences that satisfied the Robbins-Munro conditions. In both of 
these methods it was necessary to obtain a gauge of a reasonable scale of a good step-size sequence. 
For this reason in the experiment we considered step-size sequences of the form Q^fe = ^ with 

a = 0.0001, 0.001, 0.01, 0.1, 1, 2 and 4. It was found that the sequence a = 18 gave the best results 
for the approximate Newton method, while the sequence a = 0.001 gave the best results for natural 
gradient ascent. 
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